Install R and RStudio

From NIWA Apps install (R) and (RStudio), I found both under Programming.

R is a computer language and RStudio is a widely used Graphical User Interface (GUI) that calls it in the background.

To check everything is working open up RStudio , if you use Windows use the Search Windows box to locate.

Once opened you should see something like this

The section with the curser is called the Console. To test all is working, where the curser is type 3 + 4 and press enter. If all has gone well you should see something like this

******You can now close RStudio down as you are ready for the first session.******

Basics

Some terminology

  • Console: the text-based interface between you and the computer. This is where you tell the computer what to do.

  • Script: the code file you write telling the computer what to do.

  • Run: the act of actually implementing your code telling the computer what to do.

  • Syntax: how the language is written and what characters mean.

  • Function: sort of like a rule that dictates what you want to calculate.

  • Argument: an object you would give a function to calculate something.

  • Package: a collection/suite of functions/commands. If you want to use a specific package you must first install it: run install.packages("name_of_package") in R (you only need to do this once, think of it as downloading a book to your digital library). To then use the functionality offered by that package you need to call the package so that R knows where to look, kind of like requesting a specific book from your library. To call up the package run library(name_of_package) in R (you need to do this once in each R session you want to access the package’s functionality).

Don’t have a dirty workspace

In RStudio. Tools >> Global Options

  • Uncheck Restore .RData into workspace at startup
  • Uncheck Restore most recently opened project at startup
  • Uncheck Always save history (even when not saving .RData)
  • Set Save workspace to .RData on exit: to Never

Always start R wih a blank slate

Basic operations

## Start a line with hash and you can write your own comments
## This helps future you understand what present you was doing.
## To add two numbers
3 + 4 ## When run the number 7 should appear in your console
## [1] 7
## Multiply two numbers
3 * 4
## [1] 12
## "Create" and object x and give it the value 3
x <- 3
## So what is x? Type x and find out
x
## [1] 3
## What is x - 1?
x - 1
## [1] 2
## To create a vector of numbers 1--5
y <- c(1,2,3,4,5) 
## or
z <- 1:5
## Print them to the console
y
## [1] 1 2 3 4 5
z
## [1] 1 2 3 4 5
## times y by x
y*x
## [1]  3  6  9 12 15
## the mean of y
mean(y)
## [1] 3

Errors

Honestly they are in English!

Try typing your name into the console

Charlotte ## An Error should be printed to your console
## Error in eval(expr, envir, enclos): object 'Charlotte' not found

Why is this? It’s because Charlotte doesn’t exist in R because I haven’t created it.

Armed with a little common sense most Errors are easy to decipher. If not ask Google…

Introduction to R (First practical session)

Setting your “working directory”

  • In RStudio follow Session >> Set Working Directory >> Choose Directory…

  • Or in your console run setwd("path/to/your/directory")

Open an .r script

  • File >> New File >> R Script

This is where you’ll type your commands and the “Run” them in the console.

HELP

There are a few ways to ask R for help without resorting to using Google. Try running the following code.

?mean ## will open up a screen with a description of the function mean()
args(mean) ## prints the arguments the function takes to the console

Reading in data

Most data files can be read into R using a single line of code. If you haven’t set your working directory (see above) to the folder where your data file is then you must put in the full path to the file.

Functions to read in the two main types of data file extensions are listed below. In both instances the data are read in and saved as the R object data.

## read in a .csv
data <- read.csv("path/to/your/file.csv")
## read in a .txt file
data <- read.table("path/to/your/file.txt")
## NOTE for Windows machines instead of / use \\ (e.g., "C:\\Users\\Charlotte\\Documents\\file.csv")

In addition, RStudio offers a point and click version: Import Dataset (top right panel).

Let’s read in a .csv file. First download this file into your working directory (right click and “Save As”). The file includes some Ecoli data.

## read the file into R
ecoli <- read.csv("ecoli.csv") ## or the replace the .csv file name with the file name you've chosen

Reading and Writing .csv files with R

Inspecting and accessing data

Let’s look at the first few rows of the data.

head(ecoli)
##        date run replicate location value       site
## 1 3/08/2017   1         1      LBN    15 NorthShore
## 2 3/08/2017   1         2      LBN    34 NorthShore
## 3 3/08/2017   1         3      LBN    13 NorthShore
## 4 3/08/2017   2         1      LBN    12 NorthShore
## 5 3/08/2017   2         2      LBN    14 NorthShore
## 6 3/08/2017   2         3      LBN    15 NorthShore

To access a specific element of the data we use the $ character. For example, to access the column of sites you would use ecoli$site. Alternatively you can use the [] brackets to access elements of the ecoli. For example, to access the fourth row you would type ecoli[4,]. You could also access the site column using ecoli[,5] as this is the 5th column. Note to access a certain row the number goes before the comma and the column number goes after. A blank either side means access all that row or column.

Run the following commands and see if you can work out what they’re doing. Use the ## to write your own comments in your script.

str(ecoli)
table(ecoli$site)
summary(ecoli$value)
median(ecoli$value)
boxplot(ecoli$value ~ ecoli$site)
?mapply ## what do you expect this function to do?
mapply(summary, ecoli[,5:6])
?aggregate
aggregate(ecoli$value, by = list(ecoli$site),mean) ## what does this give you?

Save your script

File >> Save or Ctrl S

Everything that really matters should be achieved through code that you save.

Week One “Homework”

The following activities rely on you having read in the ecoli data you downloaded in class as the object data (not ecoli as above).

  • calculate the mean of the value column and save it as the object mean.val [HINT: to save the median of the value column as median.val you might use median.val <- median(data$value) ]

  • find the 0.99 quantile of the value column and save it as the object nn.quantile [HINT: running quantile(data$value, probs = 0.95) will calculate the quantile corresponding to a probability of 0.95]

  • install and then load the package devtools [Hint: to install the package mgcv you would run install.packages("mgcv") and to load it you might run library(mgcv)]

  • create an object called loc.max containing the maximum values of value at each location [HINT: use the aggregate and max functions]

If you want to “mark” you homework run this command devtools::source_url("https://raw.githubusercontent.com/cmjt/scripts/master/markR/week_one.r", sha1 = "262bfa2514223d04ca68eaff5f306b8d33c87b30").

Didn’t think that was enough

Plotting and exploratory data analysis (Second practical session)

Some of this session will use the Ecoli dataset we used last week. If you haven’t already done so please download this file (right click and “Save As”).

To load the data into your R session either use the Import Dataset (top right panel) or the R command read.csv() covered here

Let’s have a look again at the data. Try running all or just a couple of the following commands (these assume your data is called ecoli)

str(ecoli) ## prints a summary of the components of the dataset to screen
head(ecoli) ## prints the first 6 observetions to the screen
tail(ecoli) ## prints the last 6 observetions to the screen
head(ecoli,n = 25) ## prints the first 25 observetions to the screen

Function arguments

Last time we learnt that an R function (a type or calculation or rule) typically has arguments (options or data you want to use in the calculation).

It is worth noting that:

  • R functions arguments can be matched positionally or by name. Therefore the following calls to mean are equivalent mean(ecoli$value) mean(x = ecoli$value), mean(x = ecoli$value, na.rm = FALSE), and mean(na.rm = FALSE, x = ecoli$value). Try it for yourself.

  • You can mix positional matching with matching by name. When an argument is matched by name, it is taken out of the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition. For example, trymean(na.rm = FALSE, ecoli$value).

Basic descriptive (summary) statistics

Measures of centrality

mean(ecoli$value) ## mean value
## [1] 142.8458
median(ecoli$value) ## median value
## [1] 45

Measures of spread

range(ecoli$value) ## range
## [1]   7 997
var(ecoli$value) ## variance
## [1] 45747.86
sd(ecoli$value) ## standard deviation
## [1] 213.8875
min(ecoli$value) ## minimum
## [1] 7
max(ecoli$value) ## maximum
## [1] 997

Other descriptive statistics

quantile(ecoli$value, probs = c(0.25,0.75)) ## quantiles corresponding to the given probabilities ()
## 25% 75% 
##  15 161
summary(ecoli$value) ## summary of the data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    15.0    45.0   142.8   161.0   997.0
fivenum(ecoli$value) ## Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) 
## [1]   7  15  45 161 997

Summary statistics for multiple columns

For the following exercise we will be using data already “in” R. Run data(airquality) to access this data followed by ?airquality to read about what this dataset contains.

Try using the mapply() function touched on in this section coupled with the summary statistics above to best summarize the airquality data.

## getting maens of all columns
mapply(mean,airquality,2) ## first argument the function, second argument the the name of the data frame, third the index 1 = rows, 2 = columns
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      NA      NA     9.7    79.0     7.0    16.0
## Now try
mapply( mean,airquality,2, na.rm = TRUE) ## the extra argument  ignores the NA values when evaluating the function this is because the na.rm argument is an option of the function mean, and so to tell mapply to parse it in we add it as an extra argument
##   Ozone Solar.R    Wind    Temp   Month     Day 
##    31.5   205.0     9.7    79.0     7.0    16.0

Plotting data

For the following exercise we will again be using data already “in” R as well as both the airquality and Ecoli data above. Run data(npk) to access this data followed by ?npk to read about what this dataset contains. Now try printing the data to screen (don’t worry there are only 24 observations) npk. What do you see? What does table(npk[,2:4]) do?

The humble plot() function

Run the following and notice the different plots produced. This is because the plot() function tries its best to work out the type of data you are passing it and makes the most appropriate plot (mostly…)

plot(ecoli$value)
plot(ecoli$location)
plot(ecoli$date, ecoli$value)
pairs(airquality, panel = panel.smooth, main = "airquality data")

Other plots

pairs(airquality, panel = panel.smooth, main = "Airquality data") ## look at possible relationships
hist(npk$yield, main = "Histogram of yield") ## look at distribution using a histogram
barplot(table(ecoli[,4])) ## bar plot of counts of ecoli locations
boxplot(npk$yield ~ npk$block, ylab = "NPK yield", xlab = "Block number") ## boxplot of NPK yield by block (NOTE the ~ here)
## Whay about a density and rug plot (this requires two calls)
plot(density(ecoli$value), main = "Ecoli desnity") ## plot a smoothe density plot of ecoli
rug(ecoli$value) ## add a rug plot to it

You’ll have noticed a few times we have customized the titles or axis labels of the plots above. Try finding out what other customization is possible to each plot. [HINT: use ?plot to see what arguments there are. What does plot(npk$yield, type = "l", col = 3) do?]

Introducing the ggplot2 package for plotting.

To install the package ggplot2 run install.packages("ggplot2") and to load run library(ggplo2)

The ggplot2 package is a suite of tools designed for data visualization. Some people think this package makes it easier to visually asses your data. Using ggplot requires a slightly different syntax to base R. As you’ll notice when running the code below the + symbol is used to “add” things to the plot

library(ggplot2)
ggplot(aes(x = block,y = yield),data = npk) + geom_boxplot() + ggtitle("NPK") ## boxplots of NPK by block. Is there a difference between blocks?

ggplot(aes(x = site,y = value, color = as.factor(run)),data = ecoli) + geom_boxplot() + ggtitle("Ecoli values") + labs(color = "run number") ## boxplots of ecoli split by site and coloures by run number. Is there a difference between sites

ggplot(data = airquality, aes(x = Temp, fill = as.factor(Month))) + geom_density(alpha = .3)  + labs(fill = "Month number") ## density plots NY temperature by month

Try changing some of the arguments and see what happens (e.g., what does ggplot(aes(x = site,y = value, color = location),data = ecoli) + geom_boxplot() + ggtitle("Ecoli values") show you.)

In each of the plots above some values are shown split by some condition (e.g., month, location). Do there appear to be differences? How might you assess if these differences are “significant”? Come along next time to find out.

Week two “Homework”

Make a barplot showing the means of the Ecoli values at each site with error bars showing +/- 2 times the standard error around the mean (two possibilities shown below). NOTE: the standard error of the mean of some data can be calculated as the standard deviation divided by the square root of the number of observations, i.e., \[ se = \frac{sd}{\sqrt{(n)}}. \]

HINT: You will need to calculate the means and standard errors for each site. Research the points() and arrows() functions to add stuff to your barplot. To save the x coordinated used in a barplot you will need to save it as an object (e.g., running bar <- barplot(table(ecoli[,4])) will create a vector of x coordinates of each bar called bar.)

mns <- aggregate(ecoli$value, by = list(ecoli$site), mean) ## calculate means for each site
ses <- aggregate(ecoli$value, by = list(ecoli$site),  sd)$x/sqrt(aggregate(ecoli$value, by = list(ecoli$site),  length)$x) ## calculate standard errors for each site
plot.data <- data.frame(sites = mns$Group.1,means = mns$x,ses = ses, upper =  mns$x + 2*ses, lower =  mns$x - 2*ses) ## create data frame of data (optional for ggplot)


## Using base graphics
b.cent <- barplot(mns$x, names.arg = mns$Group.1,axes = FALSE, ylim = c(0,210)) ## need to "create" the barplot to return centre locations of each bar
points(b.cent,mns$x, pch = 20,xpd = TRUE) ## add the mean values
arrows(b.cent, mns$x - 2*ses, b.cent, mns$x + 2*ses, lwd = 2,code = 0) ## put in the erroe lines

## Using ggplot2
library(ggplot2)
ggplot(plot.data, aes(x = sites, y = means)) + 
    geom_bar(position = position_dodge(), stat="identity") +
    geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) + ylab("Mean of Ecoli") + xlab("Sites")

Try some more data visualization with ggplot2

Hypothesis testing (Third practical session)

Confidence intervals

A confidence interval can be thought of as a plausible range within which we might expect the true value of some unknown population parameter to lie. The level of “confidence” associated with this interval typically reflects the proportion of time we would expect an interval to contain the true value of the parameter. For example, if we took 100 samples, calculating the 95% confidence interval for the mean each time, we would expect the intervals to contain the “true” mean 95 times.

Bootstrapping to calculate a confidence interval for the population mean.

Using the airquality data we’re going to bootstrap for the confidence interval of the population mean

y <- airquality$Temp ## create a vector of te,perature values called y
N <- 10000 ## The number of times to resample
Y <- numeric(N) ## Create a vector of 10000 zeros, this is where we'll store the resampled means
## Note R is case sensitive so Y and y are different
## Code to carry out the reampling
for(i in 1:N){
  ## calculate the mean from a sample
  ## of size = length y from y
  ## with replacement
  Y[i] <- mean(sample(y,length(y), replace = TRUE))
}
## plot a histogram of these bootstrap samples
hist(Y)

## print a summary of these bootstrap values
summary(Y) ## note as we are resampling randomly the values won't be exactly the same for everyone.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   75.21   77.38   77.88   77.89   78.41   80.69

We can see that the mean of Y is roughly the same as the original sample (77.882), but there is a lot less scatter. A 95% bootstrap confidence interval for the expected air temperature is given by

quantile(Y,probs = c(0.025, 0.975)) ## i.e., 2.5% in lower tail and 2.5% in upper tail
##     2.5%    97.5% 
## 76.35294 79.38578

So we can say that the expected air temperature (in NY over the dates in the dataset) is somewhere between 76.353 and 79.386 95% of the time.

Loops in R

Boostrapping and the t-distribution

The sampling distribution of the sample means approaches a normal distribution as the sample size gets larger. Otherwise know as the central limit theorem

Using the standard error of an estimate we can calculate a 95% confidence interval (CI). For the population mean a 95% confidence interval can is calculated as \[\bar \mu \pm t_{97.5}\: se_\mu\] where \(\bar \mu\) is the sample mean, \(se_\mu\) is the standard error of the mean, and \(t_{97.5}\) is the t-multiplier (the 97.5% quantile of the t-distribution). This multiplier is based on the confidence level desired and depends on the sample size. A standard error of some estimated parameter is a measure of accuracy of the estimate. The standard error of a mean, \(se_\mu\) can be written as \[ se_\mu = \frac{sd}{\sqrt{(n)}} \] where \(sd\) is the standard deviation of the sample and \(n\) is the sample size.

## Now let's compare the bootstrap CI to the one based on the t-distribution
## bootstrap 
quantile(Y,probs = c(0.025, 0.975)) ## i.e., 2.5% in lower tail and 2.5% in upper tail
##     2.5%    97.5% 
## 76.35294 79.38578
## t-distribution
mean(y) + c(-1,1)*qt(1 - 0.025, df = length(y) - 1)*sd(y)/sqrt(length(y)) ## can you work out what each bit of this code is doing?
## [1] 76.37051 79.39420

From this we can see that the bootstrap and the t-distribution are equivalent.

One sample t-test and a NULL linear model

## a one sample t-test
t.test(y)
## 
##  One Sample t-test
## 
## data:  y
## t = 101.78, df = 152, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  76.37051 79.39420
## sample estimates:
## mean of x 
##  77.88235
##  fit the NULL model 
model.null <- lm(y ~ 1) ## an R object of this fitted model is created
summary(model.null) ## print summary (NOTE the estimate of the population mean is called "Intercept")
## 
## Call:
## lm(formula = y ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.882  -5.882   1.118   7.118  19.118 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.8824     0.7652   101.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.465 on 152 degrees of freedom
## let's calculate the confidence interval
confint(model.null)
##                2.5 %  97.5 %
## (Intercept) 76.37051 79.3942

From the output we can see that the one sample t-test is an example of a linear model.

NOTE: A paired comparisons is the same as a one sample t-test. This is when two measurements are taken from the same person say, which is an example of a repeated measures (twice) study.

Independent comparisons (a two sample t-test)

t.test(value ~ site, data = ecoli)
## 
##  Welch Two Sample t-test
## 
## data:  value by site
## t = -13.972, df = 2550.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -121.70841  -91.75074
## sample estimates:
## mean in group NorthShore  mean in group Waitemata 
##                 88.76602                195.49559
## Can this be formulated as a linear model?
summary(lm(value ~ site, data = ecoli))
## 
## Call:
## lm(formula = value ~ site, data = ecoli)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -188.50 -119.00  -68.77   12.23  906.23 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     88.766      5.467   16.24   <2e-16 ***
## siteWaitemata  106.730      7.680   13.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.2 on 2909 degrees of freedom
## Multiple R-squared:  0.06226,    Adjusted R-squared:  0.06194 
## F-statistic: 193.1 on 1 and 2909 DF,  p-value: < 2.2e-16

HINT: Compare the Intercept estimate (88.77) of the linear model to the mean of the North Shore values (88.77). Now compare the sum of the Intercept estimate and the siteWaitemata coefficient (195.5) to the mean of the Waitemata values (195.5). From this we can see that for a factor covariate the Intercept estimate (also called the baseline) is the estimate of the mean at the baseline factor level (in this case in the North Shore) and that the coefficient for siteWaitemata is the increase or decrease in the average value of Ecoli at Waitemata over the North Shore.

Inference when there are more than two samples

The data used in this section is available here (right click and “Save As”). This .csv file contains the dates and daily averaged temperature values in Auckland (see this section to import this into your session). We are going to use the lubridate package to deal with dates in R. To install this package run install.pacages("lubridate") (remember to run library(lubridate) before you use the functionality it contains.)

library(lubridate) ## load package
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
auck <- read.csv("../data/auckland.csv") ## read in data. This file path will be different for everyone
str(auck) ## let's have a look at the structure
## 'data.frame':    8768 obs. of  2 variables:
##  $ DATE: Factor w/ 8768 levels "1994-08-02","1994-08-03",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ TAVG: num  13.5 12.9 11.3 12.5 11.6 10.6 13.2 10.8 7 8.1 ...
## we're going to use the month function from lubridate to create another column of months
auck$MONTH <- month(auck$DATE,label = TRUE)
str(auck) ## notice the other data added
## 'data.frame':    8768 obs. of  3 variables:
##  $ DATE : Factor w/ 8768 levels "1994-08-02","1994-08-03",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ TAVG : num  13.5 12.9 11.3 12.5 11.6 10.6 13.2 10.8 7 8.1 ...
##  $ MONTH: Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 8 8 8 8 8 8 8 8 8 8 ...
boxplot(auck$TAVG ~ auck$MONTH) ## is there a difference between months (all other temporal info ignored)

## One-Way ANOVA 
anova <- aov(auck$TAVG ~ auck$MONTH) ## carry out anova
summary(anova)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## auck$MONTH    11  83199    7564    2176 <2e-16 ***
## Residuals   8749  30416       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 7 observations deleted due to missingness

It seems that at least one month has a significantly different value on average than one other. How do we find which ones? Try running TukeyHSD(anova) now plot(TukeyHSD(anova)).

Week Three “Homework”

For this quiz (Room Name STATSANDR) you will need to work with daily averaged temperatures in Christchurch (download and load these data into your R session as you did with the Auckland data above).

ANOVA Test: Definition, Types, Examples

Writing functions and coding your own hypothesis test (Fourth practical session)

Writing your own R function

Typically a function in R takes in one or more arguments, carries out some calculations, and returns a result.

Defining your own function is simple e.g.,

my_function <- function(x){
  ## this function sums the argument x and divies by its length
  ## to calculate the mean of x
  result <- sum(x)/length(x)
  return(result)
}
## Let's test this
## first of all create a vector of numbers
x <- c(1,2,9,3,4,5,6,7,5,4)
## test my_function above
my_function(x = x)
## [1] 4.6
## compare to the inbuilt mean() function
mean(x = x)
## [1] 4.6
## Now let's try a function with multiple arguments
another_function <- function(x,y, multiply = FALSE){
  if(multiply){ 
    ## if multiply = TRUE the the product of the arguments will be calculated
    result <- x*y
  }else{
    ## otherwie it'll be the sum
      result <- x + y
  }
  return(result)
}
## calculate the product
another_function(x = 3, y = 4, multiply = FALSE)
## [1] 7
## calculate the sum
another_function(x = 3, y = 4, multiply = TRUE)
## [1] 12
## Now let's try a function with multiple arguments
another_function_text <- function(x,y, multiply = FALSE){
  if(multiply){ 
    ## if multiply = TRUE the the product of the arguments will be calculated
    text <- "The product of x and y is"
    result <- x*y
  }else{
    ## otherwie it'll be the sum
    text <- "The sum of x and y is"
    result <- x + y
  }
  cat(text,result, "\n") 
  ## the cat() function prints out the objects to your console. "\n" makes the promp skip back to the next line
}
## calculate the product
another_function_text(x = 3, y = 4, multiply = FALSE)
## The sum of x and y is 7
## calculate the sum
another_function_text(x = 3, y = 4, multiply = TRUE)
## The product of x and y is 12

For more on writing functions in R

Lovebirds and Budgerigars

The data we will be using can be downloaded from here (once tab has opened right click and “Save As”). See this section to import this into your session.

The analysis below is part of a larger piece of work completed by Ben Stevenson (all code is publicly available here).

Code to produce the plot above can be found, courtesy of Ben Stevenson, [here](https://github.com/b-steve/macrorhabdus/blob/master/screening.r).

Code to produce the plot above can be found, courtesy of Ben Stevenson, here.

Permutation test

library(lubridate)
## Extracting variables.
date <- dmy(data$date)
lovebird <- data$lovebird
## Permutation test

## Turning dates into day progression.
day <- as.vector(date - date[1]) + 1
## Lovebird variables.
day.l <- day[!is.na(lovebird)]
lovebird <- lovebird[!is.na(lovebird)]
lovebird.std <- lovebird/sd(lovebird)

## Treatment day limits.
## treatment date range from 2017-05-02 to 2017-06-02
treat.lims <- day[c(18, 27)]
## All possible start- and end-date pairings.
start.perm <- 1:(max(day.l) - 31)
end.perm <- start.perm + 31

## Observed means for lovebird.
obs.l <- mean(lovebird[day.l >= treat.lims[1] & day.l <= treat.lims[2]])
obs.l.std <- mean(lovebird.std[day.l >= treat.lims[1] & day.l <= treat.lims[2]])
## Doing permutations for lovebird.
n.l <- length(start.perm)
means.l <- numeric(n.l)
means.l.std <- numeric(n.l)
for (i in 1:n.l){
    means.l[i] <- mean(lovebird[day.l >= start.perm[i] & day.l <= end.perm[i]])
    means.l.std[i] <- mean(lovebird.std[day.l >= start.perm[i] & day.l <= end.perm[i]])
}
## Getting rid of periods with no observations.
means.l <- means.l[is.finite(means.l)]
means.l.std <- means.l.std[is.finite(means.l.std)]
## Individual rank and p-value for lovebird.
sum(means.l <= obs.l)
## [1] 7
mean(means.l <= obs.l)
## [1] 0.06306306
## Standardised results should be the same.
sum(means.l.std <= obs.l.std)
## [1] 7
mean(means.l.std <= obs.l.std)
## [1] 0.06306306
## plot
hist(means.l.std, main = "", xlab = "Standardised means")
abline(v = obs.l.std,lwd = 2, lty = 2)
legend("top", bty = "n", legend = "Observed standardised mean", lty = 2, lwd = 2, cex = 0.7)

Week four “Homework”

  • Write an R function that carries out a permutation test as above taking
    • a vector of dates,
    • a corresponding vector of observations, and
    • a date range specifying treatment limits as arguments (the treatment date range is from 2017-05-02 to 2017-06-02).
  • This function should output a vector of permuted means.
  • Include an option to either carry out the permutation test on
    • the raw or
    • standardised values.
  • Use the function on the lovebird and budgie data to carry out two hypothesis tests that the observed standardised means are equal to any other permuted mean.
  • Extra: use this function to test the hypothesis that the observed sum of the budgie and lovebird standardised means during treatment is equal to any other possible permuted combinations. [HINT: look into the outer() function to compute all possible summations of the lovebird and budgie permuted means.]
#### define function
permutation_means <- function(dates, observations, date.range, raw = FALSE){
  # Turning dates into day progression.
  day <- as.vector(dates - dates[1]) + 1
  start.treat <- which(dates == min(date.range))
  end.treat <- which(dates == max(date.range))
  ##  variables.
  day <- day[!is.na(observations)]
  observations <- observations[!is.na(observations)]
  if(raw){
    observations <- observations
  }else{
    observations <- observations/sd(observations)
  }
  # Treatment day limits.
  treat.lims <- day[c(start.treat, end.treat)]
  ## All possible start- and end-date pairings.
  start.perm <- 1:(max(day) - 31)
  end.perm <- start.perm + 31
  ## Doing permutations for observations.
  n <- length(start.perm)
  means.perm <- numeric(n)
  for (i in 1:n){
     means.perm[i] <- mean(observations[day >= start.perm[i] & day <= end.perm[i]])
    
  }
  ## Getting rid of periods with no observations.
  means.perm <- means.perm[is.finite(means.perm)]
  return(means.perm)
}
#########
## The following assumes the screening data is a dataframe in your session named data
## required library
library(lubridate)
## Extracting variables.
date <- dmy(data$date)
lovebird <- data$lovebird
date.range <- c(date[18],date[27])
## calculate observed mean during treatment period for lovebirds
day.l <- date[!is.na(lovebird)]
lovebird <- lovebird[!is.na(lovebird)]
obs.mean <- mean(lovebird[day.l >= date.range[1] & day.l <= date.range[2]],na.rm = TRUE)
## carry out permutation
perm.means.raw <- permutation_means(date,data$lovebird,date.range = date.range, raw = TRUE)
## rank and p-value
sum(perm.means.raw <= obs.mean)
## [1] 7
mean(perm.means.raw <= obs.mean)
## [1] 0.06306306
## now for budgies
budgie <- data$budgie
## calculate observed mean during treatment period for budgies
day.b <- date[!is.na(budgie)]
budgie <- budgie[!is.na(budgie)]
obs.mean <- mean(budgie[day.b >= date.range[1] & day.b <= date.range[2]],na.rm = FALSE)
## carry out permutation
perm.means.raw <- permutation_means(day.b,budgie,date.range = date.range, raw = TRUE)
## rank and p-value
sum(perm.means.raw <= obs.mean)
## [1] 11
mean(perm.means.raw <= obs.mean)
## [1] 0.09401709
### standardised
## permuted means
means.l.std <- permutation_means(date,data$lovebird,date.range = date.range, raw = FALSE)
means.b.std <- permutation_means(date,data$budgie,date.range = date.range, raw = FALSE)
## Outer product of possible sums across all 31-day periods.
m <- outer(means.l.std, means.b.std, `+`)
## observed standardised mean
lovebird.std <- lovebird/sd(lovebird)
obs.l.std.mean <- mean(lovebird.std[day.l >= date.range[1] & day.l <= date.range[2]],na.rm = TRUE)
budgie.std <- budgie/sd(budgie)
obs.b.std.mean <- mean(budgie.std[day.b >= date.range[1] & day.b <= date.range[2]],na.rm = TRUE)
## Exact p-value.
mean(m <= obs.l.std.mean + obs.b.std.mean)
## [1] 0.02140602

The simple linear model (Fifth practical session)

For this section we will be using some data supplied along with base R. These data give the speed of cars and the distances taken to stop (data were recorded in the 1920s). To access this data use data(cars) and to read more about the data use ?cars. Try using some of the utility functions from above to inspect the data (e.g., head(cars), str(cars), plot(cars)).

One question we might want to ask is: Does the stopping distancs(ft) or cars relate to their speed(mph)? (in the 1920s…)

Note this question relates to the average relationship between a car’s stopping distance(ft) and its speed(mph). From a simple scatter plot it seems very likely the two are related, however there is quite a bit of variability in stopping distance for cars going the same speed (especially for faster cars).

plot(cars, xlab  = "speed (mph)", ylab = "stopping distace (ft)", pch = 20)

Let’s fit a linear model using the lm() function used above in this section.

fit <- lm(dist ~ speed, data = cars)
summary(fit)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Recall the equation of a straight line is \[y = mx + c\] where

  • c is described as the intercept (the value of \(y\) when \(x = 0\)), and
  • m is described as the slope (in our case this is the rate of increase or decrease in stopping distance, on average, for every additional 1mph in speed).

In the case of a linear model it’s the expected value of the variable , \(y\), \(E[y]\) that is assumed to follow a linear relationship with \(x\). Statisticians love Greek letters so this is typically written as \[E[y] = \beta_0 + \beta_1x.\]

The Estimate column gives us the coefficients \(\beta_0\) and \(\beta_1\), so from the output above is looks like the ‘best’ fit tells us that the predicted stopping distance (ft) of a car given a speed(mph) of speed is given by the following equation \[\widehat{\text{stopping distance}} = -17.58 + 3.93\times\text{speed}.\] Here \(\widehat{\text{stopping distance}}\) denotes the estimated stopping distance(ft) value.

Is there something we might want to be mindful of?

Let’s make some predictions for cars with speeds of 15 and 20 mph.

pred.speeds <- c(15,20)
## could do this mannually using estimates to 2dp
pred <- function(pred.speed){
  -17.58 + 3.93*pred.speed
}
pred(pred.speeds)
## [1] 41.37 61.02
## or using an inbuilt function
predict(fit, newdata = data.frame(speed = pred.speeds), type = "response")
##        1        2 
## 41.40704 61.06908
## could also get prediction intervals
predict(fit, newdata = data.frame(speed = pred.speeds), type = "response", interval = "prediction",level = 0.95)
##        fit      lwr      upr
## 1 41.40704 10.17482 72.63925
## 2 61.06908 29.60309 92.53507

Let’s look at how the model does overall.

## plot raw data
plot(cars, xlab  = "speed (mph)", ylab = "stopping distace (ft)", pch = 20)
## get predictions and 95% confidence intervals for the data points
preds <- predict(fit, type = "response", interval = "confidence", level = 0.95)
## plot "line of best fit"
lines(cars$speed,preds[,1], lwd = 3)
## plot lower 95% CI bound
lines(cars$speed,preds[,2], lwd = 3, lty = 2)
## plot upper 95% CI bound
lines(cars$speed,preds[,3], lwd = 3,lty = 2)
## put legend
legend("topleft",bty = "n", lty = c(1,2), lwd = 3, legend = c("Model fit","95% confidence bounds"))

How well does the model account for the variation in the data. The \(R^2\) statistic gives us the proportion of variance explained. We can access this value using

summary(fit)$r.squared
## [1] 0.6510794
## which gives the proportion of variance explained or altervativaly
100*summary(fit)$r.squared
## [1] 65.10794
## is the % of total variation in stoppting distance explained by using a straight line relationship.

Model checking

Out proposed model above can be written as \[y_i = \beta_0 + \beta_1x_i + \epsilon_i\] where \[\epsilon_i \overset{iid}{\sim} N(0,\sigma^2)\] for each car \(i = 1, ..., 50\). Here y denotes the stopping distance(ft) and x denotes the speed(mph). Here \(\epsilon_i\) is the random component that essentially “mops” up what’s left over from \(\beta_0 + \beta_1x_i\). We make some assumptions about \(\epsilon_i\) and these, in decreasing order of importance, are:

  • iid, independence
  • iid, identically distributed with mean 0 and a constant variance \(\sigma^2\) (called equality of variance)
  • iid Normality

Can we conclude that independence is reasonable (i.e., are cars behaving independently of one another?) I’d hope so! we can move onto the other two assumptions. Run plot(fit). The first plot shows the residuals verses the fitted values. Does the scatter look constant to you? The second plot is a Normal quantile-quantile plot. Recall we are assuming our residuals are Normally distributed so there should be a 1-1 relationship between the standardised residuals and quantiles from a Normal distribution. Is this the case?

What about Influence, are any observations “unduly” influencing the model estimates? We could use Cook’s distance to investigate this (see fourth plot) or run

 barplot(cooks.distance(fit))

Observation 49 clearly dominates this plot. However as a “rule of thumb” an observation is “influential” if Cook’s distance is greater than 0.4 or if we remove the observation any parameter estimates change by more than one standard error. The first is not the case here, but just to investigate this further re-run the model without observation 49 and see if the parameter estimates do change by more than one standard error. [HINT: use cars.new <- cars[-49,] to create a new data frame without the 49th cars observation]

Try a linear regression in R tutorial

Week five “Homework”

Using data of your own, where a linear relationship is likely, fit a linear model and go through the model checking process. Asses whether the model is appropriate. If you have any questions or get stuck please feel free to come and see me.

Modelling count data: introducing the generalised linear model (Sixth practical session)

Let’s consider the following data. Number relates to the numbers of some event recorded in each year.

count <- data.frame(Year = 2009:2018, Number = c(1,5,3,15,35,62,134,798,1500,3591)) ## make our own example data
count
##    Year Number
## 1  2009      1
## 2  2010      5
## 3  2011      3
## 4  2012     15
## 5  2013     35
## 6  2014     62
## 7  2015    134
## 8  2016    798
## 9  2017   1500
## 10 2018   3591
plot(count$Year,count$Number,pch = 20) ## raw data

plot(count$Year,log(count$Number),pch = 20)

Using a log transformation fit a linear model the the data where \[E[\text{log}(Number)] = \beta_0 + \beta_1\,\text{Year}\] (HINT: remember to log transform the response). Have a look at the previous section for details on how to do this. Check your assumptions. What do you find? Are all the assumptions met? HINT: look at a plot of Cook’s distances.

Rather than transforming the data (e.g., through using a log transformation) we could assume that the Number of events each year, Y, are a Poisson distributed count variable. In this case instead of our model being \[E[y] = \text{linear model}.\] we could have \[E[y] = \text{exp(linear model)}\] which is the same as \[\text{log}(E[y]) = \text{linear model}.\] In this case we are not transforming Y but assuming the expected value is of a multiplicative form.

So let’s redo the model you fitted above using a generalised linear model (GLM). This model will be \[E[Number] = \text{exp}(\beta_0 + \beta_1\,\text{Year}).\]

glm.fit <- glm(Number ~ Year, family = poisson, data = count)
summary(glm.fit)
## 
## Call:
## glm(formula = Number ~ Year, family = poisson, data = count)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.1884  -1.5650  -0.2101   0.5158   8.1704  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.843e+03  2.457e+01  -75.04   <2e-16 ***
## Year         9.176e-01  1.218e-02   75.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 14677.78  on 9  degrees of freedom
## Residual deviance:   138.11  on 8  degrees of freedom
## AIC: 200.93
## 
## Number of Fisher Scoring iterations: 4
## let's have a look at the residuals
plot(glm.fit, which = 1)

Assumptions of a Poisson model include that the variances, \(Var(Y)\), of the counts are the same as their means, \(E[Y]\). This means that the residual deviance\(\sim \chi_1^p\) where \(p\) is the degrees of freedom. The p-value associated with this in our case is

dev.resid <- summary(glm.fit)$deviance ## extract deviance
df <- summary(glm.fit)$df.residual ## residual degrees of freedom
1 - pchisq(dev.resid,df) ## calculate p-value
## [1] 0

Hmmm, we can reject the hypothesis that residual deviance\(\sim \chi_1^8\). This indicates that our model is not appropriate. Is this a surprise? Look again at the raw data plot. Does it seem likely that \(Var(Y) = E[Y]\)?

Can we fix this? What about if our equal mean-variance assumption was modified to \(Var(Y) = k\,E[Y]\) where \(k > 1\). This can be done simply by assuming a “quasi”-Poisson distribution. To do do run

quasi.fit <- glm(Number ~ Year, family = quasipoisson, data = count)
## let's have a look at the summary
summary(quasi.fit)
## 
## Call:
## glm(formula = Number ~ Year, family = quasipoisson, data = count)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.1884  -1.5650  -0.2101   0.5158   8.1704  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.843e+03  1.014e+02  -18.18 8.62e-08 ***
## Year         9.176e-01  5.027e-02   18.25 8.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 17.04166)
## 
##     Null deviance: 14677.78  on 9  degrees of freedom
## Residual deviance:   138.11  on 8  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
## and calculate come confidence intervals for the parameters
exp(confint.default(quasi.fit,print.out = FALSE)) ## back transform manually
##                2.5 %   97.5 %
## (Intercept) 0.000000 0.000000
## Year        2.268288 2.762374

So this suggests that the annual number of events increases by between 2.26 and 2.76 times each year. Compare these estimates to those from the linear (on the transformed counts) and the Poisson model above.

Try a logistic regression in R tutorial

Week six “Homework”

Using data of your own, where the appropriate conditions are met, fit either a logistic or a Poisson model going through the model checking process. Asses whether the model is appropriate. If you have any questions or get stuck please feel free to come and see me.

Spatial models in INLA (Seventh practical session)

To read more about the INLA project, see here.

First of all you’ll need to download the R-INLA package. To do so run install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE) (this might take a while).

library(INLA)

Data collected by regions

This section illustrates the fitting of disease mapping with count data (Scotland, UK). We use a Besag-York-Mollier model (BYM) for the spatial structure (latent effect). For more details see this tutorial.

data(Scotland)
head(Scotland)
##   Counts   E  X Region
## 1      9 1.4 16      1
## 2     39 8.7 16      2
## 3     11 3.0 10      3
## 4      9 2.5 24      4
## 5     15 4.3 10      5
## 6      8 2.4 24      6
source(system.file("demodata/Bym-map.R", package="INLA")) ## load function to plot map
Bym.map(Scotland$Counts)

To vizualise the graph you may need to run the following code to install the Rgraphviz package.

source("http://bioconductor.org/biocLite.R")
biocLite("Rgraphviz")
library(Rgraphviz) ## needed to plot the graph
g = system.file("demodata/scotland.graph", package="INLA") ## get graph
graph <- inla.read.graph(g)
plot(graph)

## set up formula
formula = Counts ~ f(Region, model="bym", graph = g,
                          param=c(0.5,0.0005)) + I(X/10)
## fit model
mod.scotland = inla(formula,family="poisson",E = E,data = Scotland)
summary(mod.scotland)
## 
## Call:
## c("inla(formula = formula, family = \"poisson\", data = Scotland, ",  "    E = E)")
## 
## Time used:
##  Pre-processing    Running inla Post-processing           Total 
##          0.6943          0.2209          0.1954          1.1106 
## 
## Fixed effects:
##                mean     sd 0.025quant 0.5quant 0.975quant    mode kld
## (Intercept) -0.1155 0.1095    -0.3319  -0.1151     0.0985 -0.1144   0
## I(X/10)      0.2656 0.1138     0.0374   0.2671     0.4852  0.2701   0
## 
## Random effects:
## Name   Model
##  Region   BYM model 
## 
## Model hyperparameters:
##                                              mean        sd 0.025quant
## Precision for Region (iid component)     1006.096 1143.0216     41.134
## Precision for Region (spatial component)    2.419    0.7706      1.245
##                                          0.5quant 0.975quant   mode
## Precision for Region (iid component)      642.975   4119.465 94.272
## Precision for Region (spatial component)    2.307      4.246  2.098
## 
## Expected number of effective parameters(std dev): 27.30(3.286)
## Number of equivalent replicates : 2.051 
## 
## Marginal log-Likelihood:  -150.01
Bym.map(mod.scotland$summary.random$Region$mean)

Data collected at point locations

This section is based on this tutorial where we assume a latent Gaussian Markov Field for the spatial auto-correlation. See this material for further details.

First of all load some more required libraries. You may need to install them first by running install.packages("...").

library(sp)
library(fields)
library(geoR) ## for the ca20 data

We will be using data already in R

data('ca20')

Use the utility functions you’ve learnt to explore the data. Use ?ca20 to learn about the data. You’ll notice this dataset is stored as a list with named elements. The first element sre the coordinates, the second the response vaiable. We can still use the $ sign to access the elements.

For ease and other reasons we are going to scale the data. Notice we are using a linear transformation which does not change the structre of the data.

df <- data.frame(y = ca20$data, locx = ca20[[1]][ , 1], locy = ca20[[1]][ , 2], ca20[[3]])
spatial.scaling <- 100
df$locx <- (df$locx - min(df$locx))/spatial.scaling
df$locy <- (df$locy - min(df$locy))/spatial.scaling
df$altitude <- df$altitude - mean(df$altitude)
df$y <- df$y - 50

Let’s make a quilt plot to look at the data

quilt.plot(x = df$locx,y = df$locy,z = df$y,nx = 40,ny = 40, col = terrain.colors(101),
           main = "Data")

Modelling

To inform the mdel as to the spatial structure of the observations a mesh is constructed. This is the discretization of the study area (essentially the area is divided up into small triangles, similar to creating a raster or a grid over space).

max.edge <- 0.5
mesh <- inla.mesh.2d(
  loc=df[ , c('locx', 'locy')],
  offset = c(0.5, 1.5),
  max.edge=c(max.edge, max.edge*3),
  # discretization accuracy
  cutoff=max.edge/5)

## let's have a look at the mesh
plot(mesh, asp=1)
points(df[ , c('locx', 'locy')], col='red', pch = 20) ## add the observation locations
axis(1)
axis(2) ## add axis ticks

We now need to connect the measurement locations to the mesh representation, we need the so-called A-matrix.

A = inla.spde.make.A(mesh=mesh, loc=data.matrix(df[ , c('locx', 'locy')]))

Create the covariate data frame. Note the intercept = 1 will automatically expand so that it is the same length as the altitude covariate.

covariates <- cbind(intercept = 1, altitude = df$altitude)

Create the data ‘stack’

stack <- inla.stack(tag='est',
                    # - Name (nametag) of the stack
                    # - Here: est for estimating
                    data=list(y = df$y),
                    effects=list(
                      # - The Model Components
                      s=1:mesh$n,
                      # - The "s" is means "spatial"
                     covariates = covariates),
                      # - The second is all fixed effects
                    A = list(A, 1)
                    # - First projector matrix is for 's'
                    # - second is for 'fixed effects'
                    )
prior.median.sd = 1; prior.median.range = 7

spde = inla.spde2.pcmatern(mesh, prior.range = c(prior.median.range, .5), prior.sigma = c(prior.median.sd, .5), constr = T)

formula = y ~ -1 + covariates + f(s, model=spde)

prior.median.gaus.sd = 5.5

family = 'gaussian'
control.family = list(hyper = list(prec = list(
  prior = "pc.prec", fixed = FALSE, param = c(prior.median.gaus.sd,0.5))))
res <- inla(formula, data=inla.stack.data(stack),
            control.predictor=list(A = inla.stack.A(stack), compute=T),
            # compute=T to get posterior for fitted values
            family = family,
            control.family = control.family,
            #control.compute = list(config=T, dic=T, cpo=T, waic=T), 
            # - Model comparisons
            control.inla = list(int.strategy='eb'),
            # - faster computation
            #control.inla = list(int.strategy='grid'),
            # - More accurate integration over hyper-parameters
            verbose=F) ## will take a few seconds
summary(res)
## 
## Call:
## c("inla(formula = formula, family = family, data = inla.stack.data(stack), ",  "    verbose = F, control.predictor = list(A = inla.stack.A(stack), ",  "        compute = T), control.family = control.family, control.inla = list(int.strategy = \"eb\"))" )
## 
## Time used:
##  Pre-processing    Running inla Post-processing           Total 
##          1.8105          2.9530          0.9414          5.7049 
## 
## Fixed effects:
##               mean     sd 0.025quant 0.5quant 0.975quant   mode kld
## covariates1 0.2116 1.8206    -3.3630   0.2115     3.7831 0.2116   0
## covariates2 2.1414 1.8530    -1.4966   2.1414     5.7764 2.1414   0
## 
## Random effects:
## Name   Model
##  s   SPDE2 model 
## 
## Model hyperparameters:
##                                           mean     sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.0276 0.0070     0.0167   0.0266
## Range for s                             3.1859 0.8669     1.7879   3.0873
## Stdev for s                             9.4461 1.5645     6.8041   9.2926
##                                         0.975quant   mode
## Precision for the Gaussian observations      0.044 0.0248
## Range for s                                  5.172 2.9002
## Stdev for s                                 12.944 8.9751
## 
## Expected number of effective parameters(std dev): 63.61(0.00)
## Number of equivalent replicates : 2.798 
## 
## Marginal log-Likelihood:  -647.85 
## Posterior marginals for linear predictor and fitted values computed

Create a function to plot spatial effects

local.plot.field = function(field, mesh, xlim=c(0,11), ylim=c(0,9), ...){
  stopifnot(length(field) == mesh$n)
  # - error when using the wrong mesh
  proj = inla.mesh.projector(mesh, xlim = xlim, 
                             ylim = ylim, dims=c(300, 300))
  # - Can project from the mesh onto a 300x300 plotting grid 
  field.proj = inla.mesh.project(proj, field)
  # - Do the projection
  image.plot(list(x = proj$x, y=proj$y, z = field.proj), 
             xlim = xlim, ylim = ylim, col = terrain.colors(100), ...)  
}

Plot the posterior mean of the spatial effect

local.plot.field(res$summary.random[['s']][['mean']], mesh)
axis(1)
axis(2)

Plot uncertainty

local.plot.field(res$summary.random$s$sd, mesh)
axis(1)
axis(2)

Plot fitted values

quilt.plot(x = df$locx,y = df$locy,z = res$summary.fitted.values$mean[1:nrow(df)],nx = 40,ny = 40, col = terrain.colors(100), main="Fitted values", 
           zlim = range(df$y))

Compare this to the quilt plot of the observed data above.

Going further with INLA

Week seven “Homework”

Using appropriate spatial data of your own use INLA to fit either of the models above. If you have any questions or get stuck please feel free to come and see me.